Using OpenAI's GPT-3 to Generate 'Doctor Who' Episode Synopses
You can now track your GPT-3 fine-tunes with Weights & Biases! Here's how to do just that, but with more David Tennant than you were otherwise expecting
Created on February 6|Last edited on March 25
Comment
Introduction
Coming up with sci-fi stories is no easy task. You need the creativity to imagine impossible things and the ability to build realistic worlds with special rules, not to mention the narrative skill to keep these weird and complex plots driving towards a satisfying (or terrifying) conclusion. Most of us can't do this. Science fiction, as it turns out, is hard to write.
So what if, instead of spinning a new story from whole cloth, we looked towards some existing intellectual property? What if we could use AI to help generate new sci-fi stories? And not just generate them not for any old show, but for the iconic British television show Doctor Who? How exactly might that look?
We've got the answer in this very report.
Today, we're going to take OpenAI's awesome GPT-3 model for a spin and explore the training metrics, model predictions, and, importantly, some episode synopses the model's generated based on a scant 300 episode summaries. This is all part of our collaboration with the folks at OpenAI.
The particulars here are pretty simple, though we'll explain them a bit more below. The important thing to know up front is this: we're using OpenAI's APIs to fine-tune GPT-3 on those abovementioned Dr. Who synopses. You'll see how it worked later in the report, but we wanted to start with something fun:
During the course of this tutorial, we'll be tossing in a few pairs of Dr. Who episode summaries. One is real, one is generated by GPT-3 based on our fine-tune and a real episode title. We've obscured the answer so it feels a little bit more like a trivia question. In fact, let's start there, shall we?

Real or Fake: #1
Title: "The Dalek Invasion of Earth"
Option A) The Daleks are back, in the form of a deadly new breed of robot, determined to exterminate humanity and conquer the planet Earth. Led by the evil and despotic Thal, the Daleks have spread their evil empire across the galaxy, with their most powerful forces based on Earth. They are ready to pounce. The Doctor, Polly and Ben are all in their own way in the way of the Daleks and are in danger of being exterminated. With the TARDIS immobilised, the only hope of rescue seems to lie with the Doctor’s oldest and most deadly enemy, the Cybermen.
But can the Doctor overcome his hatred for the Cybermen and work with them to defeat the Daleks?
Option B) The TARDIS materialises beside the River Thames in the year 2167. The world to which they emerge, however, is eerily silent and signs of decay are rife. They soon discover that the streets are patrolled by zombie-like humans, flying saucer-like craft occupy the skies and that desperate rebels are hiding below ground in the disused Underground system. But worse is still to come! The Daleks, thought destroyed by the Doctor, have invaded and enslaved humanity, having first bombarded it with cosmic storms. Why have the Daleks chosen Earth in particular? What is the significance of the giant mine in Bedfordshire to which they are shipping human slaves?
And can the Doctor unite with the small band of resistance fighters to stop the Daleks before their plans wreck havoc with the Earth forever?
<--Open the toggle to see the answer
OpenAI & Weights & Biases
If you're new to either W&B or OpenAI, here's a quick primer on both:
Weights & Biases is the world's leading, developer-first MLOps platform. With just a few lines of code, you can instantly debug, compare, and reproduce your models – that means architecture, hyperparameters, git commits, model weights, GPU usage, datasets, and predictions. We're trusted by over a hundred thousand ML practitioners at some of the most innovative companies and research organizations around today, including OpenAI.
OpenAI, meanwhile, is a global research leader in the machine learning space. Their mission is to ensure that artificial general intelligence benefits all of humanity and are famous for their cutting-edge research in robotics, music and text generation and a whole lot more.
This particular collaboration involves OpenAI's state of the art language generation model, GPT-3.
Dataset Collection
GPT-3 is capable of producing really tremendous text generation examples out of the box. But fine-tuning it for a specific domain is what we're interested in today. OpenAI's fine-tuning documentation states we need at least a few hundred samples for optimal performance and that the more data we have for training, the better.
Fine-tuning performs better with more high-quality examples. To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance.
Lucky for us, the show's been around for 60 years and has over 800 episodes. With the help of a simple scraper, I was able to gather 304 episode synopses across the show's history.
NB: We were especially impressed with how well the fine-tuning of just 304 examples performed. The model picked up on character names, show nuances, and even British spellings of words like "materialises" in the example above.
💡
Here's the dataset I've collected visualized as a W&B Table, our tool for exploring tabular dataset data and model predictions. I would recommend checking it out but not too closely as you might spoil our little "guess the real synopsis" game. These examples below were the data we used for our fine-tune.
Exploring our Training Data with W&B Tables
A powerful thing about W&B Tables is that it allows us to explore our dataset by compiling certain queries with StringOps. For example, here's an example of a W&B Table where we search for all episode synopsis which contain the word "Dalek" – one of the most notorious monsters in Doctor Who.
Note: In this particular example, we're looking to find how many episode synopses have the word Dalek in them, not the episode titles. And we can see that 21 episode synopses contain that word.
💡
Of course, finding which entries of your dataset contain the name of deadly sci-fi monster might not be what you are looking to do in your particular use case. You could do something very similar if you were looking for words like "refund" or "dangerous" in a series of customer success inquiries or something similar. We just assumed it would be more enjoyable to explore our GPT-3 logging with something less business-y.
Real of Fake #2:
Episode title: "Kill the Moon"
Option A) In the near future, the Doctor and Clara find themselves on a space shuttle making a suicide mission to the Moon. Crash-landing on the lunar surface, they find a mining base full of corpses, vicious spider-like creatures poised to attack, and a terrible dilemma. When Clara turns to the Doctor for help, she gets the shock of her life.
Option B) Earth has been invaded, and the Doctor and Rose fight their way through the Nazis to reach the Moon. There they discover an even greater enemy, and one that lies hidden in the heart of the satellite. Who is the Doctor’s old friend, the Master? And what is the secret behind the strange blue box? The Doctor’s quest for the answer takes him to the end of time itself – and straight into the heart of an impossible mystery.
<--- Open the toggle to see the answer
Fine-tuning GPT-3 on custom data using OpenAI API
Fine-tuning any model on custom data allows it to create better predictions in a specific domain. GPT-3 is no exception. It's important to note here that there many tasks that fine-tuning GPT-3 model can be useful, things like classification, conditional generation or open-ended generation. Which is to say that this report is meant as a walkthough of our collaboration, not the end-all, be-all. That said: since our goal is to input a potential episode title into the model and get a predicted synopsis, we'll be dealing with conditional generation in this tutorial.
To fine-tune a GPT-3 model on a custom dataset for conditional generation we need to do the following:
- Collect and preprocess our dataset with the OpenAI CLI tool.
- Configure which model we'll use (there are different variations of GPT-3 available for fine-tuning, like "Ada", "Babbage" or "Curie") and with which hyperparameters, then begin training.
- Sync OpenAI API fine-tune jobs data with Weights & Biases - with literally one line of code - where it can be explored inside an interactive dashboard. No really, it's just one line of code:
$ openai wandb sync
- Here's a screenshot of what mine looks like:

- And now the fun part: perform inference either on validation data or on any type of data you want using OpenAI API in Python or OpenAI's Playground. Like, that's the part where can put in an episode name we want, and get a synopsis for it.
Real or Fake #3:
Episode title: "The Fires of Pompeii"
Option A) The Doctor and Donna travel back into ancient history. When they arrive in 79AD, however, they discover psychic powers and beasts of stone running riot in the streets of old Pompeii. The time-travellers face their greatest challenge yet – can established history be changed, or must the Doctor let everyone die?
Option B) On the anniversary of the eruption of Mount Vesuvius, the Doctor and Romana receive a distress call from the planet Pompeii. The call is followed by a radio silence and the Doctor’s assistant, Leela, goes to investigate. What she discovers is a city in which the entire population has vanished. A city that is also winking out of existence!
The Doctor and Romana join forces with Senator to investigate the disaster. Aided by a visit from the Time Lord’s old friend, Adric, they discover that the city’s destruction is part of a sinister plan to cover up evidence of the volcano’s rebirth as a sentient being.
<--- Open the toggle to see the answer
Preprocessing our Dataset
All of the steps described in this Report are in the Colab notebook I've put together and linked below. To follow it and fine-tune your model, all you need is to have an OpenAI API account and a Weights & Biases account. Here's that link:
You can start off by pasting your API key to authenticate in the Colab notebook.
# Enter credentials%env OPENAI_API_KEY=
Then we download the raw .csv dataset I've put together with the synopses stored in my account as Weights & Biases Artifacts.
run = wandb.init(project='GPT-3 to Generate Doctor Who Synopses')artifact = run.use_artifact('ivangoncharov/GPT-3 to Generate Doctor Who Synopses/dw_synopses_csv:v0', type='raw_dataset')
Next up, we'll pass our dataset through OpenAI's CLI data preparation tool. Here's what the OpenAI docs say about the tool:
This tool accepts different formats, with the only requirement that they contain a prompt and a completion column/key. You can pass a CSV, TSV, XLSX, JSON or JSONL file, and it will save the output into a JSONL file ready for fine-tuning, after guiding you through the process of suggested changes.
In my case, I have a simple .csv file with two columns being "prompt" and "completion." If you want to plug in your own data, you can just create a similar file and then pass it through the tool.
!openai tools fine_tunes.prepare_data -f dw_synopses.csv
The tool will guide you through reformatting your dataset. In my case, the main changes are that an arrow ("->") is appended to all episode titles. Also, that a space character at the beginning and an " END" (with a space too) are added to all completions to help the model better understand where the prompts and completions begin and where they end.
It looks something like this:
{"prompt":"Frontier in Space ->","completion":" When the identity of the far deadlier foe waiting in the wings… END"}
After that, we just have to split the JSONL file into the training and validation sets:
#The dataset has 304 pairs in total!head -n 274 dw_synopses_prepared.jsonl > dw_train.jsonl!tail -n 30 dw_synopses_prepared.jsonl > dw_valid.jsonl
Next, we'll define the type of model we should use and its hyperparameters. Check out this section in OpenAI docs to learn about the models available for fine-tuning, as well as their strengths and weaknesses.
model = 'curie' # can be ada, babbage or curien_epochs = 4batch_size = 4learning_rate_multiplier = 0.1prompt_loss_weight = 0.1
That's it! It really is that simple.
Which means, now, we're ready to jump into training. All we have to do is pass along our data and the hyperparameters. Of course, since the training is happening in the cloud on OpenAI's end we don't have to worry about having a giant GPU sitting around or even using a GPU runtime on Colab. And the fine-tuning process on this data didn't take more than 15 minutes for me.
!openai api fine_tunes.create \-t dw_train.jsonl \-v dw_valid.jsonl \-m $model \--n_epochs $n_epochs \--batch_size $batch_size \--learning_rate_multiplier $learning_rate_multiplier \--prompt_loss_weight $prompt_loss_weight
Ok. One more synposes guessing game:
Real of Fake #4:
Title: "Invasion of the Dinosaurs"
Option A) The TARDIS brings the Doctor and his companions to a junkyard planet where a crashed spaceship from the future contains the remains of dinosaurs that have been petrified by a meteorite impact. The petrified dinosaurs are then used by a group of alien mercenaries, led by a woman named Voord, to attack the junkyard planet in order to recover a device that can modify their leader’s mind into that of a dinosaur.
Option B) Returning to London the Doctor and Sarah find a city almost completely devoid of life. The civilian population has been evacuated in the wake of an unimaginable event: somehow Dinosaurs have returned to terrorise the Earth.
<--- Open the toggle to see the answer
Syncing OpenAI API Fine-Tunes with Weights & Biases
Now, after our model is fine-tuned successfully we'll get a message telling us the fine-tune ID. In this case it's curie:ft-wandb-2022-02-06-22-15-01.

We can go ahead and play around with it and it even might work but happens when we want to add in more data and see if it helps or not or try a different model configuration and keep track of how it does?
That's where the OpenAI and Weights & Biases collaboration comes into play! With just one line of code we can sync all of our fine-tunes data from OpenAI to our W&B dashboard.
!openai wandb sync --project "GPT-3 to Generate Doctor Who Synopses"
And now we can see the fine-tune details and metadata logged to W&B. As you can see here on the left, for example, I've set up a view on for myself here that lets me see at the first glance which fine-tune was done on which model, what training data was used, and what's the number of epochs was (like, 16, 10, or 4 on the screenshot).

On the right, you can see all of the training metrics pulled automatically into Weights & Biases for each fine-tune. This means that no matter how many fine-tunes you do, you can always easily compare which of those perform the best.
Here's what the training charts look like:
Additionally, I've shown above how I've pinned some of the hyperparameters to be displayed in the project dashboard to give us more understanding behind the naming of the fine-tunes.
We can also find all of the hyperparameters and metadata of the model and the fine-tune inside the run page itself:

This is really important for future reproducibility: we can see which models work the best and what hyperparameters they use to do so, as well as what data they were trained on.
Not bad for one line of code, huh?
Last Real or Fake Synopses Guess:
Episode title: "The Husbands Of River Song"
A: It’s Christmas Eve, and the Doctor is waiting for his wife, Song, to wake from her regenerative sleep. But as he sits with her, the newly-regenerated Doctor begins to doubt the truth of their marriage. Is the Song he knows real, or has she been taken over by the River Song, an alien creature of immense power, who will stop at nothing to keep her hold on the Doctor?
As the Doctor investigates, he comes upon an old friend, the Judoon, and an even older enemy, the Cybermen. What is the link between these seemingly unconnected events? And what is inside the snow globe that the Cybermen are so interested in?
B: It’s Christmas Day on a remote human colony and the Doctor is hiding from Christmas Carols and Comedy Antlers. But when a crashed spaceship calls upon the Doctor for help, he finds himself recruited into River Song’s squad and hurled into a fast and frantic chase across the galaxy. King Hydroflax is furious, and his giant Robot bodyguard is out of control and coming for them all!
Will Nardole survive? And when will River Song work out who the Doctor is? All will be revealed on a starliner full of galactic super-villains and a destination the Doctor has been avoiding for a very long time.
<--- Open the toggle to see the answer
Using Weights & Biases to Log and Explore Model Predictions on Validation Data
The best way to evaluate the performance of generative models is to look at their predictions on the data they've not seen during training, a.k.a validation data.
With Weights & Biases, we can log model predictions on the val data as a W&B Table and easily visually explore them.
We start off by creating an evaluation job using wandb - the Weights & Biases client:
# create eval jobrun = wandb.init(project='GPT-3 to Generate Doctor Who Synopses', job_type='eval')entity = wandb.run.entity
Okay, so here's a very neat part: we use the fact that the metadata for our fine-tunes was synced to Weights & Biases to find the metadata for the last fine-tuned model, and get the fine-tune id (such as ft-xxxxxxxxx) which we need in order to call the OpenAI API in Python to make the predictions.
Next, we specify the number of validation data samples we want to then log the predictions this validation set. In our case, we set it to all 30 of them.
n_samples = 30df = df.iloc[:n_samples]
And, finally, we start a loop iterating over the validation prompts and outputting the GPT-3 model completions.
data = []for _, row in tqdm(df.iterrows()):prompt = row['prompt']res = openai.Completion.create(model=fine_tuned_model, prompt=prompt, max_tokens=300, stop=[" END"])completion = res['choices'][0]['text']completion = completion[1:] # remove initial spaceprompt = prompt[:-3] # remove " ->"target = row['completion'][1:-4] # remove initial space and "END"data.append([prompt, target, completion])
Note: In this code we also remove the special formatting things about fine-tuning GPT-3 data like the " ->" character and an "END" at the end, so that our data looks clean once we visually explore it
💡
Next up, we can define and log a W&B Table in two lines of code:
prediction_table = wandb.Table(columns=['prompt', 'target', 'completion'], data=data)wandb.log({'predictions': prediction_table})
Here are the predictions I logged as a W&B Table. Note again that our model has not seen the "target" column below:
And don't forget, the cool thing about W&B Tables (as I've showed you at the beginning of this Report) is that you have access to sorting and performing StringOps to analyze the data you have in the best way possible.
Performing Inference on New Data with Fine-Tuned GPT-3
As you saw with the validation dataset example, after you've completed analyzing the training and validation data on Weights & Biases and found the model you liked, you can use pretty much the same logic (as with val dataset predictions) to use the model on new data with OpenAI's API in Python.
res = openai.Completion.create(model=fine_tuned_model, prompt=prompt, max_tokens=300, stop=[" END"])completion = res['choices'][0]['text']completion = completion[1:] # remove initial space
Another way to have some fun and play around with your model's predictions within a nice UI is to use OpenAI's playground.
Same logic here: first, for the GPT-3 model configuration that you think is performing best, get the name of the fine-tune:

And then specify that name in OpenAI's Playground, put in some of your own data, and have fun!
Here's an episode of Doctor Who titled "GPT-3".

Resources for Further Learning about fine-tuning GPT-3 with Weights & Biases and OpenAI API
- Boris Dayma has a great report (and corresponding Colab notebook with it) about fine-tuning GPT-3 on a Wikipedia dataset.
Conclusion
In this report, we learned about how we can fine-tune a GPT-3 model on our own data using OpenAI's API and also how to best understand the state of our fine-tunes using Weights & Biases.
Hope you had fun with the quiz along the way - I know I did putting it together - and, please, let me know how many you got correct or not? Why? And have you watched the show before?
Also, before I go. If any Doctor Who writers are reading this and like the sound of some of the generated synopses featured here, just so you know, I am available at competitive rates.
Add a comment
Great work Ivan! I can't work out if I should be ashamed of my performance or impressed with GPT-3 ... let's just say I didn't do very well in "Real or fake"
Reply
Tags: NLP, Text Generation, OpenAI, Experiment, Panels, Plots, Tables, Exemplary, Large Models, LLM, GPT
Iterate on AI agents and models faster. Try Weights & Biases today.